Can coarse-graining introduce long-range correlations in a 

symbolic sequence? 

S. L. Narasimhan, Joseph A. Nathan* and K. P. N. Murthy** 
Solid State Physics Division, Bhabha Atomic Research Center, Mumbai-400085, India 
* Reactor Physics Design Division, Bhabha Atomic Research Center, Mumbai-400085, India 
** Materials Science Division, Indira Gandhi Center for Atomic Research, 
Kalpakkam - 603102, Tamilnadu, India 

Abstract 

We present an exactly solvable mean-field-like theory of correlated ternary sequences which are 
actually systems with two independent parameters. Depending on the values of these parameters, 
the variance on the average number of any given symbol shows a linear or a superlinear dependence 
on the length of the sequence. We have shown that the available phase space of the system is made 
up a diffusive region surrounded by a super diffusive region. Motivated by the fact that the diffusive 
portion of the phase space is larger than that for the binary we have studied the mapping between 
these two. We have identified the region of the ternary phase space, particularly the diffusive part, 
that gets mapped into the super diffusive regime of the binary. This exact mapping implies that 
long-range correlation found in a lower dimensional representative sequence may not, in general, 
correspond to the correlation properties of the original system. 

PACS numbers: 05.40.-a, 02.50.Ga, 87.10,+c 



The dynamical behavior of complex systems consisting of a large number of hierarchi- 
cally organized subsystems is known to carry the signature of long-range spatio-temporal 
correlations (LRC) that might be present in them. A variety of physical [1], biological [2], 
linguistic [3] and even financial systems [2] provide examples of LRC systems. A standard 
method for studying correlations in such a system is to first divide its state space into a finite 
number of distinctly labelled regions, map the sequence of states assumed by the system into 
a sequence of these symbols and then study the statistical properties of this representative 
sequence. 

Since such a coarse-graining procedure is not expected to lead to a loss of long-range 
correlations in the system, we may choose to divide the state-space into two distinct regions 
and study the statistical properties of the resulting binary sequence. In fact, we generate 
a correlated binary sequence, parametrizable by a known procedure, and try to find the 
parameter values that correspond to the representative sequence. 

Recently, it has been shown [4] that the strength of long-range correlations in a binary 
sequence of length, N, can be characterized by a parameter ji G [—1,1] so that non-trivial 
correlations between the symbols correspond to the parameter region, | /i |> 1/2. More 
specifically, the variance, a 2 (N), of the the number of a symbol has been shown to have 
a form, o- 2 (N) oc N a , where a — 1 for | /j, \< 1/2 (diffusive ) and a > 1 for | /j, \> 1/2 
(superdiffusive) [5]. This exact mean- field-like theory of correlated binary sequences seems 
to provide a paradigm for studying the correlational properties of generic symbolic sequences 
such as even natural language texts. 

The implicit assumption in this approach is that an LRC system can be represented by 
a correlated binary sequence. There may be no a priori reason why this assumption should 
hold good. For example, it has been argued [6] that we need a minimum of ten letters 
(not less than five letters, in any case) to be able to design a foldable model of amino acid 
sequences. So, the minimum number of symbols required for a sequential representation of 
the system (or equivalently, the extent to which the state-space of a system can be coarse- 
grained) may depend on the specific behaviour of the system under study. Even if a binary 
sequential representation is acceptable, the problem of how to arrive at this representation 
remains if the number of symbols originally associated with the system is odd. 

Thus, it is necessary to examine whether a coarse-graining procedure could introduce long- 
range correlation, besides the one that might be present in the sequence already [7] . To this 
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end, we present an exact mean-field-like theory of correlated ternary sequences and identify 
the diffusive subregion of its phase space that gets spuriously mapped into the superdiffusive 
region of the phase space associated with the binary sequence. This exact mapping implies 
that long-range correlation found in a lower dimensional representative sequence may not, 
in general, correspond to the correlation properties of the original system. 

I. TERNARY SEQUENCES 

Let Tjv,i ( = U-nU-n+i, U-i) denote a subsequence of N ternary symbols, t e {0, 1, 2}. 
Then the conditional probability, p(ti \ T N>i ), that the ith symbol tj in the sequence will 
be 0, 1 or 2 may not only depend on the number of individual symbols but, in general, 
may also depend on their specific order in Tjv,i- Ignoring the configuration-dependence of 
p(ti | Tjv j) leads to a solvable mean- field-like theory of these ternary sequences. We define 
the conditional probability, p(ti = | T N;i ) as follows: 

p(U = | T N ,i) = p (T N;i ) = -jy(n g + n x g x + n 2 g 2 ) (1) 

where n , n\ and n 2 denote the number of 0's, l's and 2's respectively in the sequence such 
that n + rii + n 2 = N, and go, g\ and g 2 denote the a priori probabilities of choosing the 
respective symbols of which only two are independent because go + g± + g 2 = 1. We can 
parametrize the deviations of g t from their 'unbiassed' values 1/3 by writing g t = (1 +/x t )/3 
where //t=o,i,2 £ [ — 1 5 2] and /i + /ii + ^ 2 = 0. It is clear that the /z's are a measure of the 
'memory' built into the system which in turn leads to correlations between the symbols. 

Since one of the three symbols is definitely going to be found at any given place in the 
subsequence, i.e.,^2 t p t (TN ti ) = 1, we need only to define pi(T/v,j) or p 2 {Tn^). Now, the 
probability that the i th symbol is not is given by 

Pi(T N ,i) + P2(T N ,i) = 1 - Po(T N ,i) =qi + q2 (2) 

where 

<?i = [n 9i + n 1 g 2 + n 2 g ] ; q 2 = [n g 2 + mg + n 2 g ± ] (3) 

so that we may identify pi(T N ^) either with q x or with q 2 . That is to say, we may either 
have Pi(T N:i ) = q 1 = Po(T^i) or have Pi(T N ^) = q 2 = Po(^,i), where we define the a- 
complementary of T N>i as = tf_ N ■ ■ -tf_ 1: with t a = (t + a)( mod 3). We choose the first 
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definition: 

Pi(T N ,i) = p (T^i) = ^(n gi + rng 2 + n 2 g ) (4) 

P2(T N ,i) = Po(TN,i) = jf(no92 + n igo + n 2 g 1 ) (5) 

Clearly, these definitions, Eqs.(l, 4 & 5), ensure that p (T Nji ) +pi(T Nji ) + p 2 (T N:i ) = 1. The 
other choice for Pi{Tm,i) can be shown to lead to the same result with a simple coordinate 
transformation. Taking and 1 as the independent symbols, we can write Eqs.(l & 4) in 
the form, 



Po(T N ,i) = Po(n , m; N) = ^ ^1 - ^/i - ^^i) 



(6) 



Pi(T N ,i) ee Pl (n , n l5 N) = ±(l + ^ + [M /^U ) (7) 

where jV = [AT — (2n + m)] and Af ± = [N - (n + 2m)]. 

A ternary sequence of iV symbols is completely described by the probability, Q(n , ni,N), 
that there are no number of zeroes and m number of ones in the sequence: 

Q(n , m; iV + 1) = p (^o - 1, m; N)Q(n - l, m; N) 
+Pi(n , m - l; N)Q(n , m - 1; N) 
+p 2 (n ,m;^)<5K,m;^v) (8) 

Since the average number of any symbol in the sequence will be iV/3 asymptotically (i.e., 
no global bias in the system), we rewrite the above equation in terms of the variables, 
x = 3n — N and y = 3m — N. In doing so, we make use of the correspondence (n , ni,N + 
1) -> (x-l,j/-l;JV + l), (n -l, ni ;N) -> (x-3,y;JV) and (n ,m-l;^V) -> (x,j/-3;JV) 
as can be seen from the definitions of x and y. 

Q(a;-l,?/-l;JV + l) = p (x-3,y,N)Q(x-3,y;N) 

+ Pl (x,y-3;N)Q(x,y-3;N) 

+ (1 - [p (x, y; N) + Pl (x, y; N)])Q(x, y; N) (9) 
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where the probabilities po(x, y : N) and pi(x, y; N) are given by, 

l at\ l (^ , [ 2 ^o + ^i] . [/io + 2/ii] \ 
Po(x, y; iV) = 3 + —3^* + J (10) 

Pi(x,y;JV) = - li + —^- x -—^— y ) (11) 



«9Q 1^ (e?Q + 9^ _ 7 f d[(X x + X iy )Q] + 0[(As-A o y)Q] \ 



The continuum version of Eq.(9) is then given by 

cPQ cf<_ 

dN ~ 4^ V 9x 2 + <9y 2 y iVTiVo V dx ' dy 

where D and 7 represent the diffusion and drift constants respectively. We have introduced 
the parameter N to allow for the possible existence of transient time in the problem. The 
A's are defined by, 

Ao = ^(2/10 + aO; a i = \(^o + 2/ii); A = i(/i! - /i ) (13) 

A standard method of solving Eq.(12), with the initial condition Q(x,y;N = 0) = 
S(x)S(y), is to first Fourier transform it with respect to the variables x and y: 

dQ(q x ,q y ;N) 1 2 2 ~ 7 / dQ <9Q\ 

= -t^>(9x + + T7— TT t A o^ + My\jr- + [A1& - A g y ]^- (14) 



<9iV 4 y ™ ^ N + Nq \ l ^ ^dq x u " yj ag : 



2/ 



where g x and q y are the Fourier conjugates of x and 7/ respectively. This first order equation 
can then be solved by the method of characteristics. In particular, we have to solve the 
equations, 

dN = dq x = dq y 
N + N X q x + \q y X x q x - X q y 

Considering the second and the third terms leads to the equation, 

(-X q y + Xiq x )dq x = (X q x + Xq y )dq y (16) 
that can be solved for q x in terms of q y or vice versa: 

q x = ^ + x 2X °\ y ; or q y = jqx] where a = -X ± [X 2 + XX ± ] 1/2 (17) 
Using these relations, we can immediately obtain their iV-dependence. 

q x , q y K(N + N o y r ; T = ±[Aq + AA X ] 1/2 (18) 
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which in turn helps us define the variables, 



^ = q x (N + N ) r ; rj = q y (N + N ) r 



(19) 



in terms of which Eq.(14) can be written as 




= -\D(e + V 2 )T- 2r Q(Z,V,T) (20) 



where r = (N + N ). With the initial condition, Q(q x , q y ; r = N ) = 1, we immediately have 
the solution, 



Q(q x , q y ; r) = exp \ --a 2 (r)(q 2 x + q 2 y ) \ (21) 



from which we identify the variance, 



/ AT \ -^i +i 



It is clear from the above expression that a 2 (r) oc t u , where v — 1 whenever 2r < 1 and 
z/ = 2r whenever 2r > 1. Even in the case, 2r = 1, the exponent v — 1 but there will be 
logarithmic corrections. 

The existence of a critical value, r c = 1/2, for T implies that the parameter space, 
{(Ho, fii) | —1 < Ho, Hi < 2& — 2 < (fiQ + fii) < 1}, is divided into two regions, diffusive 
and super diffusive. As shown in Fig.l, the diffusive region defined by the condition T < 1/2 
is the elliptical region, + lA + /-^i < 3/4, inscribed within the triangular phase-space. 
Interestingly, the area of this diffusive region is iry/3/2 which is roughly 60% of the available 
phase-space area. This may be contrasted with the binary case where it is exactly 50%. Now 
the question arises whether a diffusive subregion of the ternary is likely to be mapped into 
a super diffusive region of the binary due to a coarse-graining process. Since a mapping of a 
set of three symbols to a set of two symbols always introduces a global bias in the resulting 
binary system, it is necessary to reformulate the binary case so that we can identify long- 
range correlation in a biassed sequence. 

II. BINARY SEQUENCES, REVISITED 

The conditional probability, Po(n , N), of finding zero as the (N + l) th symbol in a binary 
sequence that already consists of n zeros is given by the definition, 



p (n , N) 



1 



(n g + [N - n }gi) 



(23) 



N 
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FIG. 1: Phase diagram for the ternary sequence. L\ and Li refer to the lines \iq + [i\ = 1 and 
^o + ^i = 1/4 respectively. The ellptical region of the triangular phase space corresponds to the 
diffusive behavior whereas the region exterior to it corresponds to the superdiffusive behavior of the 
ternary sequences. The entire region between the lines L\ and L2, inclusive of the shaded elliptical 
region, gets mapped into the superdiffusive regime of the binary. 

where go and g\ = 1 — go are the a priori probabilities of choosing symbols zero and one 
respectively. If a biassed coin (i.e., g 7^ 1/2) is used for building up a sequence, then the 
average number of zeros in the sequence, < n >, is expected to be bN asymptotically, 
where the intrinsic bias 6 is equal to go if and only if the symbols are added blindly without 
reference to the existing sequence (i.e., Po(n , N) does not depend on n ). In order that the 
equality < n >= bN holds good in general, it is necessary that the dependence of po(n , N) 
on n is through the variable, x = (n /b) — N . In analogy with the unbiassed case [4], we 
write go = 6(1 + fi) where /i G [—1, —1 + (1/6)] pametrizes the deviation from the intrinsic 
bias of the coin. We then have 

p (x, N) = a+ f x where (3 = b(2g - 1); a = (5 - g + 1 (24) 

iV + iVQ 

In order to check whether the equality < n >= bN holds good for \x in the range [— 1, — 1 + 
(1/6)], we have generated [8] a large number of sequences each consisting of upto hundred 
thousand symbols for 6 = 2/3. We find that < n > /N ~ 0.655 for fi = 1/2 and decreases 
fast to the value 1/2 for /j, < 0.3. On the other hand, the variance o~ 2 (N) =< Uq > — < no > 2 
turns out to be proportional to N 2 ^ 90-1 ' for g > 3/4, and to iV for g < 3/4. This implies 
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that a, defined as above, does not lead to the expected constant value for < n > /N, 
whereas j3 correctly defines the correlation exponent. Since the above definition of a is 
derived from Eq.(23) for p (n ,N), we have to check whether Eq.(23) is a valid definition 
also for the biassed case. Treating a as a parameter to be fixed later, we can easily show 
that the probability, Q(n , N), that there are n zeros in the sequence satisfies the following 
equation in the continuum limit: 

^ 1 h h^ Q I h a ) dQ P d[xQ] 

dN 26 2 1 ' dx 2 { b' dx b(N + N ) dx 1 1 

Solving this equation, we can show that the distribution, Q(uq,N), peaks at < n >= 
(1 + f(a))bN asymptotically where f(ct) = (6 — a)/[2b(g — 1)]. Thus, the peak will remain 
fixed at bN for all values of /x only if a = b. This leads to the following definition for 

p (n ,N): 



(2^o - 1) 



N 



n 



b 



n 



p (n ,N) = b l+ vy " ' -l-N =26(l-<to) + (2<to-l)-J (26) 



N 



It is a general definition applicable even to an unbiassed sequence (i.e., reduces to the 
definition, Eq.(23), for b = 1/2). The variance of the distribution is given by, 

AN)J N + N " for ( 2i , - 1) < 1/2 

[(N + iVo) 2 ^"- 1 ) for (2g - 1) > 1/2 

Since g = 6(l+/i), long-range correlation in the sequence will be characterized by the values 
of ji in the range \x E (—1 + [3/46], —1 + [1/6]). Again, we have numerically checked [8] that 
the condition a = 6 (= 2/3 in our case) does ensure the constancy of < n > /N(— 2/3) and, 
more importantly, we have confirmed that the variance has the above expected behaviour. 
In other words, the LRC behavior of a binary sequence is characterized by the exponent, 
(2<7o-l), for go > 3/4. 



III. MAPPING THE TERNARY INTO THE BINARY 

Assume that the symbols zero and one of the ternary are identified as '0' while the symbol 
two is identified as T'. Then the a priori probability for '0' and T' will be 

go = |(2 + // + A*i); (7i = |(1 + /x 2 ) (28) 
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The condition for non-trivial correlations then turns out to be /i + /ii > 1/4, above the 
lower line shown in Fig.l. The entire region of the phase space between this line and the 
upper line, /x + /ii = 1, corresponds to long-range correlation when mapped into the biassed 
binary. This is so because the correlation parameter, /x, for the binary is related to /xo and 
/Xi by the identity, /x = (/x + /xi)/2 which in turn implies a correspondence between the 
range 1/4 < (/x + /xi) < 1 of the ternary and the range 1/8 < /x < 1/2 of the biassed 
binary; every line in between and parallel to the upper and lower lines in Fig.l is mapped 
into a point in the range 1/8 < /x < 1/2 and vice versa. In particular, the shaded part 
of the ellipse in Fig.l is the diffusive area that is mapped into the superdiffusive regime 
of the biassed binary, which in general is characterized by a parameter, /x, in the range 
H e (-1 + [3/46], -1 + [1/6]). 

Conversely, the fact that the correlation parameter of the biassed binary sequence under 
study has a value in the range /x G (— 1 + [3/46], — 1 + [1/6]) does not necessarily mean that 
the parent ternary sequence also has long-range correlations. Even if we know a priori that 
there are long-range correlations between symbols of the parent sequence, the parameter /x 
is not a true measure of its strength. 

IV. SUMMARY 

We have worked out the exact phase diagram for a ternary sequence with long-range 
correlation. Motivated by the fact that the diffusive portion of the phase space is larger 
for the ternary than for the binary, we have studied the mapping between these two. We 
have shown that long-range correlation for the binary does not necessarily imply long-range 
correlation for the ternary. This exact result has deeper implications for the coarse-graining 
of many- alphabets sequences [8]. For example, if we do not know that the original sequence 
has long-range correlations or (and) if we do not know the coarse-graining procedure that has 
led to the representative sequence under study, then we may not be able to make accurate 
inferences about the correlation properties of the original sequence. A systematic numerical 
study of this problem will be reported elsewhere. It could be of interest to do a similar study 
for a non-Markov sequence. 
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